[multi-gpu] Phase 1: namespace channel_type, add cross-rank attrs, doc plan by erwei-xilinx · Pull Request #1576 · Xilinx/mlir-air

erwei-xilinx · 2026-05-03T17:10:45Z

First step toward multi-GPU messaging support. Pure IR/dialect changes — no lowering yet (Phases 2–7 land separately as #1577–#1582).

Summary

`channel_type` namespace rename (Option 1)

Existing values gain a npu_ prefix to make backend scope explicit:

Before	After
`dma_stream` (default)	`npu_dma_stream`
`dma_packet`	`npu_dma_packet`
`cascade`	`npu_cascade`
`mmio`	`npu_mmio`

Mechanical rename across 33 files (verifier, transform/conversion passes, all .mlir tests, Python programming examples).

New GPU multi-rank channel type

gpu_symmetric_heap: cross-rank channel through the symmetric heap runtime (runtime_lib/airgpu/symmetric_heap.{h,cpp}). Verifier requires put/get sites to be inside an air.rank scope.

`air.dma_memcpy_nd` cross-rank addressing

Optional src_rank / dst_rank integer attributes name a peer rank in the enclosing air.rank scope.
Verifier requires:
- an enclosing air.rank scope
- the peer-side memref's memref.alloc (when directly available) to carry the air.symmetric attribute
Backward-compatible custom builder so existing callers compile unchanged.

`air.symmetric` memref attribute

A unit attribute on memref.alloc indicating the allocation should be backed by the symmetric heap. Documented in docs/AIRComputeModel.md §2.7.

Documentation

docs/AIRComputeModel.md updated to describe the new IR surface:

§2.4 cross-rank addressing on air.dma_memcpy_nd
§2.5 channel_type table including the npu_* rename and gpu_symmetric_heap
§2.7 air.symmetric memref attribute
§5 summary table updated for cross-rank / multi-GPU concepts

Test plan

All 21 mlir/test/Dialect/AIR/ tests pass (positive round-trip + verifier negatives)
New air_cross_rank_dma.mlir: round-trip for src_rank/dst_rank, air.symmetric memref, gpu_symmetric_heap channel inside air.rank
air_channel_invalid.mlir: gpu_symmetric_heap put/get outside air.rank rejected
air_memcpy_invalid.mlir: src_rank/dst_rank outside air.rank rejected, missing air.symmetric on alloc rejected
CI clang-format / black format
AIE backend regression (covered by CI)
GPU end-to-end (Phases 2–7 in [multi-gpu] Phase 2: hand-written e2e test for symmetric-heap multi-GPU #1577–[multi-gpu] Phase 7: aircc integration (--multi-gpu flag) #1582 build on this PR; all 5 INPUT variants pass at W=2/4/8 on rad-mi325x-1)

🤖 Generated with Claude Code

Apply clang-format-17 reflow to three .cpp files (text-string wrapping across the renamed channel_type values "npu_mmio" / "npu_cascade" / "npu_dma_stream") and black reformat to one .py file (npu_cascade arg list now exceeds the line limit). These were reported by the lintAndFormat workflow on PR Xilinx#1576; this commit folds them into Phase 1 so the diff CI saw is what's now in tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR is Phase 1 of multi-GPU messaging support by extending the AIR IR surface: it namespaces existing NPU channel types, adds a GPU-specific symmetric-heap channel type, and introduces cross-rank addressing attributes plus an air.symmetric allocation marker, along with corresponding verifier rules, tests, and documentation updates.

Changes:

Renames existing channel_type values to npu_* to make backend scope explicit.
Adds gpu_symmetric_heap channel type (rank-scoped) and cross-rank src_rank/dst_rank attributes on air.dma_memcpy_nd gated by air.rank + air.symmetric.
Updates verifier logic, MLIR tests, examples, and compute model documentation to cover the new IR surface.

Reviewed changes

Copilot reviewed 45 out of 45 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
programming_examples/matrix_vector_multiplication/bf16_cascade/matvec_cascade.py	Updates cascade channel example to use `npu_cascade`.
programming_examples/herd_dataflow/run.py	Updates default and cascade channel_type strings to `npu_*`.
programming_examples/herd_dataflow/air.mlir	Updates channel declarations/comments to `npu_*` naming.
programming_examples/flash_attention/kernel_fusion_based/attn_npu2.py	Renames cascade channels to `npu_cascade`.
programming_examples/flash_attention/kernel_fusion_based/attn_npu1.py	Renames cascade channels to `npu_cascade`.
programming_examples/flash_attention/dataflow_based/attn.py	Renames cascade channel attribute to `npu_cascade`.
programming_examples/channel_examples/mmio/mmio.py	Renames mmio channel type to `npu_mmio` and updates docstring.
programming_examples/channel_examples/dual_herd_packet_switch/dual_herd_packet_switch.py	Updates comment to refer to `npu_dma_packet`.
programming_examples/channel_examples/channel_3d_segment_unroll/channel_3d_segment_unroll.py	Renames cascade channel to `npu_cascade` and reformats call.
programming_examples/cascade_reduction/cascade_reduction.py	Renames cascade channel to `npu_cascade`.
mlir/test/Transform/AIRMiscPasses/air_split_l2_memref.mlir	Updates FileCheck expectations to `npu_dma_packet`.
mlir/test/Transform/AIRMiscPasses/air_collapse_herd.mlir	Updates cascade channel types to `npu_cascade`.
mlir/test/Transform/AIRHerdPlacement/cascade_placement.mlir	Updates cascade channel declarations to `npu_cascade`.
mlir/test/Transform/AIRDmaToChannel/dma_to_channel_no_auto_packet.mlir	Updates negative checks to `npu_dma_packet`.
mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet.mlir	Updates expected upgraded channel types to `npu_dma_packet`.
mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet_single_herd.mlir	Updates expected upgraded channel types to `npu_dma_packet`.
mlir/test/Transform/AIRDmaToChannel/dma_to_channel_auto_packet_broadcast.mlir	Updates broadcast upgrade expectations to `npu_dma_packet`.
mlir/test/Transform/AIRDependencyScheduleOpt/fuse_channels.mlir	Updates stream/packet channel types to `npu_*` for non-fusion test.
mlir/test/Dialect/AIR/air_memcpy_invalid.mlir	Adds verifier-negative tests for cross-rank `src_rank`/`dst_rank` and missing `air.symmetric`.
mlir/test/Dialect/AIR/air_cross_rank_dma.mlir	New round-trip tests for cross-rank DMA attrs, `air.symmetric`, and `gpu_symmetric_heap`.
mlir/test/Dialect/AIR/air_channel.mlir	Updates channel type round-trips and adds `gpu_symmetric_heap` parse/print coverage.
mlir/test/Dialect/AIR/air_channel_invalid.mlir	Updates allowlist diagnostic and adds verifier negatives for `gpu_symmetric_heap` outside `air.rank`.
mlir/test/Dialect/AIR/air_canonicalize.mlir	Updates cascade channel type to `npu_cascade` in canonicalization test.
mlir/test/Conversion/ConvertToAIR/scf_parallel_to_herd.mlir	Updates cascade channel check to `npu_cascade`.
mlir/test/Conversion/AIRToAIE/shim_pkt_channel_sharing.mlir	Updates packet channels to `npu_dma_packet`.
mlir/test/Conversion/AIRToAIE/shim_packet_flow_npu.mlir	Updates packet channel types to `npu_dma_packet`.
mlir/test/Conversion/AIRToAIE/shared_shim_channel_packet_ids.mlir	Updates packet channel declarations to `npu_dma_packet`.
mlir/test/Conversion/AIRToAIE/segment_unroll_packet_flow_ids.mlir	Updates intra-device packet channels to `npu_dma_packet`.
mlir/test/Conversion/AIRToAIE/good_shim_packet_flow_npu_4col.mlir	Updates packet channel to `npu_dma_packet`.
mlir/test/Conversion/AIRToAIE/bad_shim_packet_flow_npu_1col.mlir	Updates packet channel to `npu_dma_packet`.
mlir/test/Conversion/AIRToAIE/air_shimcpy_to_npu.mlir	Updates multiple packet channel types to `npu_dma_packet`.
mlir/test/Conversion/AIRToAIE/air_channel_to_locks_core_to_core.mlir	Updates cascade channel declarations to `npu_cascade`.
mlir/test/Conversion/AIRToAIE/air_channel_mmio.mlir	Updates mmio tests to `npu_mmio` and stream default to `npu_dma_stream`.
mlir/test/Conversion/AIRToAIE/air_channel_mmio_invalid.mlir	Updates mmio-negative diagnostics and channels to `npu_mmio`.
mlir/lib/Util/Util.cpp	Changes default inferred channel type to `npu_dma_stream`.
mlir/lib/Transform/AIRMiscPasses.cpp	Updates cascade detection to `npu_cascade`.
mlir/lib/Transform/AIRLinalgCodegen.cpp	Updates generated channel default to `npu_dma_stream`.
mlir/lib/Transform/AIRHerdPlacementPass.cpp	Updates cascade channel collection to `npu_cascade`.
mlir/lib/Transform/AIRDmaToChannel.cpp	Updates created/upgraded channel types to `npu_*` and mmio exclusion to `npu_mmio`.
mlir/lib/Dialect/AIR/IR/AIRDialect.cpp	Adds cross-rank DMA verification, enforces rank-scope for `gpu_symmetric_heap` put/get, and updates channel_type allowlist to namespaced values.
mlir/lib/Conversion/ConvertToAIRPass.cpp	Updates cascade channel creation to tag `npu_cascade`.
mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp	Updates internal resource type strings to `npu_*` and mmio handling to `npu_mmio`.
mlir/lib/Conversion/AIRToAIEPass.cpp	Updates mmio gating and resource-type branching to `npu_*` names.
mlir/include/air/Dialect/AIR/AIR.td	Adds `src_rank`/`dst_rank` attrs, changes default channel_type to `npu_dma_stream`, and documents `gpu_symmetric_heap`.
docs/AIRComputeModel.md	Documents cross-rank DMA attrs, namespaced channel types, and `air.symmetric` attribute; updates summary tables accordingly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Six Copilot comments on PR Xilinx#1576: 1. AIRToAIESchedulingUtils.cpp: four diagnostic strings still said "dma_stream / dma_packet" after the rename to "npu_dma_stream / npu_dma_packet". Updated. 2. docs/AIRComputeModel.md (cross-rank DMA, §2.4): said the GPU backend lowers src_rank/dst_rank, contradicting the summary table that calls it "planned". Reworded as "planned: air-cross-rank-dma- to-mgpu" to match. 3. docs/AIRComputeModel.md (air.symmetric, §2.7): same inconsistency for mgpuSymmetricAlloc routing. Reworded as "planned: air-symmetric-alloc-to-mgpu". 4. AIR.td (DmaMemcpyNdOp description): same inconsistency. Reworded. 5. AIR.td (gpu_symmetric_heap channel_type description): claimed "Lowered by air-to-rocdl to thread-cooperative loops..." with no such lowering yet in tree. Reworded as "planned: air-gpu-channel-to-mgpu". 6. AIRDialect.cpp DmaMemcpyNdOp::verify: rank indices are non-negative. Added explicit `>= 0` check, plus matching verifier- negative tests in air_memcpy_invalid.mlir for both src_rank=-1 and dst_rank=-3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…c plan Step toward multi-GPU messaging support per docs/MultiGPUPlan.md. Pure IR/dialect changes — no lowering yet. ## channel_type namespace rename (Option 1) Existing channel_type values gain a `npu_` prefix to make backend scope explicit: - `dma_stream` → `npu_dma_stream` (default) - `dma_packet` → `npu_dma_packet` - `cascade` → `npu_cascade` - `mmio` → `npu_mmio` Mechanical rename across 33 files (verifier, transform/conversion passes, all .mlir tests, Python programming examples). ## New channel_type for GPU multi-rank messaging - `gpu_symmetric_heap`: cross-rank channel through the symmetric heap runtime (runtime_lib/airgpu/symmetric_heap.{h,cpp}). Verifier requires put/get sites to be inside an `air.rank` scope. ## air.dma_memcpy_nd cross-rank addressing - New optional integer attributes `src_rank` / `dst_rank` name a peer rank in the enclosing `air.rank` scope. - Verifier requires: - an enclosing `air.rank` scope - the peer-side memref's `memref.alloc` (when directly available) to carry the `air.symmetric` attribute - Backward-compatible builder so existing call sites compile unchanged. ## air.symmetric memref attribute A unit attribute on `memref.alloc` indicating the allocation is backed by the symmetric heap. Documented in docs/AIRComputeModel.md §2.7. ## Documentation - New docs/MultiGPUPlan.md: full design and 7-phase implementation plan - docs/AIRComputeModel.md: §2.4 cross-rank addressing, §2.7 air.symmetric, §2.5 channel_type table updated, §5 summary table updated ## Tests - mlir/test/Dialect/AIR/air_cross_rank_dma.mlir (new): positive round-trip for src_rank/dst_rank, air.symmetric memref, gpu_symmetric_heap channel put/get inside air.rank - mlir/test/Dialect/AIR/air_channel_invalid.mlir: gpu_symmetric_heap put/get outside air.rank rejected; updated unsupported channel_type error message - mlir/test/Dialect/AIR/air_memcpy_invalid.mlir: src_rank/dst_rank outside air.rank rejected; missing air.symmetric on alloc rejected All 21 mlir/test/Dialect/AIR/ tests pass; GPU dma_copy and 4k_4k_mul e2e tests pass on MI300A. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Apply clang-format-17 reflow to three .cpp files (text-string wrapping across the renamed channel_type values "npu_mmio" / "npu_cascade" / "npu_dma_stream") and black reformat to one .py file (npu_cascade arg list now exceeds the line limit). These were reported by the lintAndFormat workflow on PR Xilinx#1576; this commit folds them into Phase 1 so the diff CI saw is what's now in tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Six Copilot comments on PR Xilinx#1576: 1. AIRToAIESchedulingUtils.cpp: four diagnostic strings still said "dma_stream / dma_packet" after the rename to "npu_dma_stream / npu_dma_packet". Updated. 2. docs/AIRComputeModel.md (cross-rank DMA, §2.4): said the GPU backend lowers src_rank/dst_rank, contradicting the summary table that calls it "planned". Reworded as "planned: air-cross-rank-dma- to-mgpu" to match. 3. docs/AIRComputeModel.md (air.symmetric, §2.7): same inconsistency for mgpuSymmetricAlloc routing. Reworded as "planned: air-symmetric-alloc-to-mgpu". 4. AIR.td (DmaMemcpyNdOp description): same inconsistency. Reworded. 5. AIR.td (gpu_symmetric_heap channel_type description): claimed "Lowered by air-to-rocdl to thread-cooperative loops..." with no such lowering yet in tree. Reworded as "planned: air-gpu-channel-to-mgpu". 6. AIRDialect.cpp DmaMemcpyNdOp::verify: rank indices are non-negative. Added explicit `>= 0` check, plus matching verifier- negative tests in air_memcpy_invalid.mlir for both src_rank=-1 and dst_rank=-3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous commit (888bcaa) added a `>= 0` verifier on src_rank / dst_rank, but used `getSrcRank()` / `getDstRank()` — those return `std::optional<uint64_t>` (a TableGen quirk for `OptionalAttr<I64Attr>`), so `*sr < 0` on the unsigned value is always false and the check never fired. The two new verifier-negative tests in air_memcpy_invalid.mlir silently regressed. Switch to the typed `getSrcRankAttr()` / `getDstRankAttr()` accessors which return `IntegerAttr`, then call `.getInt()` for a real `int64_t`. The check now fires on negative values; both negative-rank tests pass under `lit -sv ../../mlir/test/Dialect/AIR` (21/21). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

origin/main grew 5 new herd-placement tests via Xilinx#1583 that use the pre-rename `channel_type = "cascade"`. After this PR's namespace rename ("cascade" -> "npu_cascade"), those tests fail under air-opt with the verifier rejecting the old name. Update them to "npu_cascade" so they keep passing on top of phase 1. Verified on rad-mi300a-sh5-1: AIRHerdPlacement 15/15 pass, Dialect/AIR 21/21 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI on 'Build and Test with AIE tools on Ryzen AI (amdhx370)' caught one more stale "cascade" reference: test/xrt/34_cascade_vecadd/run_peano.py embeds an inline MLIR string that declared `channel_type = "cascade"`. Update to "npu_cascade" to match the namespace rename. The corresponding run_chess.py variant didn't have this issue. Verifier diagnostic from the failing job: 'air.channel' op unsupported channel_type "cascade"; expected one of "npu_dma_stream", "npu_dma_packet", "npu_cascade", "npu_mmio", or "gpu_symmetric_heap" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

erwei-xilinx force-pushed the multigpu-phase1-channel-types-and-cross-rank branch from abbc586 to 38b7e10 Compare May 3, 2026 20:21

erwei-xilinx marked this pull request as ready for review May 6, 2026 01:04

erwei-xilinx requested review from fifield and jgmelber as code owners May 6, 2026 01:04

Copilot AI review requested due to automatic review settings May 6, 2026 01:04

erwei-xilinx requested a review from eddierichter-amd as a code owner May 6, 2026 01:04

Copilot started reviewing on behalf of erwei-xilinx May 6, 2026 01:06 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

erwei-xilinx and others added 5 commits May 6, 2026 04:47

erwei-xilinx force-pushed the multigpu-phase1-channel-types-and-cross-rank branch from 90c90d6 to 965f853 Compare May 6, 2026 04:48

erwei-xilinx added this pull request to the merge queue May 6, 2026

Merged via the queue into Xilinx:main with commit fd62d7c May 6, 2026
27 checks passed

erwei-xilinx deleted the multigpu-phase1-channel-types-and-cross-rank branch May 6, 2026 16:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[multi-gpu] Phase 1: namespace channel_type, add cross-rank attrs, doc plan#1576

[multi-gpu] Phase 1: namespace channel_type, add cross-rank attrs, doc plan#1576
erwei-xilinx merged 6 commits into
Xilinx:mainfrom
erwei-xilinx:multigpu-phase1-channel-types-and-cross-rank

erwei-xilinx commented May 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erwei-xilinx commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

channel_type namespace rename (Option 1)

New GPU multi-rank channel type

air.dma_memcpy_nd cross-rank addressing

air.symmetric memref attribute

Documentation

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erwei-xilinx commented May 3, 2026 •

edited

Loading

`channel_type` namespace rename (Option 1)

`air.dma_memcpy_nd` cross-rank addressing

`air.symmetric` memref attribute